DATA SCIENCE SESSIONS VOL. 3

A Foundational Python Data Science Course

Session 13: Simple Linear Regression. Estimation Theory continued: the Parametric bootstrap.

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


1. Simple Linear Regression

We will use the Fish.csv data set in this session. You can grab it from Kaggle: fish-market. Please place the Fish.csv data set into your _data directory.

Target: predict Weight from Height

Linear model has the form

$$y = \beta_1 x + \beta_0 + \varepsilon,$$

where

The predicted value $\hat{y}$ of the target variable is computed via Liner regression via

$$\hat{y} = \beta_1 x + \beta_0.$$

Ok, statsmodels can do it; how do we find out about the optimal values of $\beta_0$ and $\beta_1$? Let's build ourselves a function that (a) tests for some particular values of $\beta_0$ and $\beta_1$ for a particular regression problem (i.e. for a particular dataset) and returns the model error.

The model error? Oh. Remember the residuals:

$$\epsilon_i = y_i - \hat{y_i}$$

where $y_i$ is the observation to be predicted, and $\hat{y_i}$ the actual prediction?

Next we do something similar to what happens in the computation of variance, square the differences:

$$\epsilon_i^2 = (y_i - \hat{y_i})^2$$

and define the model error for all observations to be the sum of squares:

$$SSE = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$$

Obviously, the lower the $SSE$ - the Sum of Squared Error - the better the model! Here's a function that returns the SSE for a given data set (with two columns: the predictor and the criterion) and a choice of parameters $\beta_0$ and $\beta_1$:

Test lg_sse() now:

Check via statsmodels:

Method A. Random parameter space search

Check with statsmodels:

Not bad, how about 100,000 random pairs?

Method B. Grid search

A grid more dense:

Check with statsmodels:

Method C. Optimization (the real thing)

The Method of Least Squares

Here is the real thing.

And how do we do that?

Well, in a particular case of a (Simple or Multiple) Linear Regression Model, it turns out that is possible to provide an analytical solution for all model parameteres that minimize the model $SSE$ (error) function. It takes some time work through the partial derivates of $SSE$ in respect to each model parameter, but it works in the end.

But finding analytical solutuion will not work for just any statistical model.

Now, imagine that we have an algorithm - call it an optimization algorithm - that can find the parameters that minimize a respective function. Indeed we have such an algorithm. Indeed we have many different such algorithms, developed by experts in the very, very alive and complicated branch of mathematics called Optimization Theory. We will put one such algorithm - the famed Nelder-Mead Simplex Method - to work in order to minimize $SSE$ in respect to $\beta_0$, $\beta_1$.

Check against statsmodels

Final value of the objective function (the model SSE, indeed):

Check against statsmodels

Error Surface Plot: The Objective Function

This the function that we have minimized:

Back to statsmodels

Linear Regression using scikit-learn

2. Parametric Bootstrap

Bias and Variance via the Bootstrap

We will now begin working with resampling methods in statistics.

The first resampling method that we will consider is the Parametric Bootstrap. It is magic, belive me.

Do you remember from our previous discussions of Estimation Theory what a Bias and Variance of a statistical estimator are?

Is there a method to estimate the Bias of a statistical estimator, at all? For example: what are the biases of $\beta_0$, $\beta_1$ in our Simple Linear Regression model?

First: the model parameters and their standard errors

Second: the standard deviation of model residuals

Third: the Sim-Fit Loop, Parametric Bootstrap

Compare with model parameter variances as estimated from the original Linear Regression

$$\hat{Bias}_{boot} = \frac{1}{B} \sum_{r=1}^B \hat{\theta}^{*(r)} - \hat{\theta},$$

where $\hat{\theta}$ is the original estimator, $\hat{\theta}^{*(r)}$ is the estimate of $\hat{\theta}$ based on the $r$ bootstrap sample, and $B$ is the number of bootstrap samples.

Or, in a purely theoretical way, assuming that the full distribution of all possible bootstramp samples could ever be known:

$$Bias_{boot} = E_{\mathcal{F}^*}[\hat{\theta}^{*}] - \hat{\theta},$$

where $\hat{\theta}$ is the original estimator, $\hat{\theta}^{*}$ is the estimate of $\hat{\theta}$ based on a single bootstrap sample, and $E_{\mathcal{F}^*}$ denotes the expected value with respect to the distribution of bootstrap samples $\mathcal{F}^*$.

In this formula, the absence of a hat symbol above $Bias_{boot}$ indicates that this is the true bias, rather than an estimate of the bias.

Further Reading


DataKolektiv, 2022/23.

hello@datakolektiv.com

License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.